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Abstract 

Multiple testing problems arising in modern scientific applications can involve simultane¬ 
ously testing thousands or even millions of hypotheses, with relatively few true signals. In this 
paper, we consider the multiple testing problem where prior information is available (for instance, 
from an earlier study under different experimental conditions), that can allow us to test the hy¬ 
potheses as a ranked list in order to increase the number of discoveries. Given an ordered list of 
n hypotheses, the aim is to select a data-dependent cutoff k and declare the first k hypotheses 
to be statistically significant while bounding the false discovery rate (FDR). Generalizing several 
existing methods, we develop a family of “accumulation tests” to choose a cutoff k that adapts 
to the amount of signal at the top of the ranked list. We introduce a new method in this fam¬ 
ily, the HingeExp method, which offers higher power to detect true signals compared to existing 
techniques. Our theoretical results prove that these methods control a modified FDR on finite 
samples, and characterize the power of the methods in the family. We apply the tests to simu¬ 
lated data, including a high-dimensional model selection problem for linear regression. We also 
compare accumulation tests to existing methods for multiple testing on a real data problem of 
identifying differential gene expression over a dosage gradient. 

Keywords, false discovery rate, ordered hypothesis testing, sequential hypothesis testing, accumu¬ 
lation test, power, multiple testing. 


1 Introduction 

In many modern applications of statistics, the availability of high-dimensional data sets allows si¬ 
multaneous testing of a large list of potential hypotheses. This can lead to the well-known issue 
of multiple testing—if each hypothesis is tested with a significance threshold a, e.g. a = 0.05, 
then in the scenario where the number of true signals among the n hypotheses is small, we can 
expect ~ a • n false discoveries. Many methods have been developed to handle this issue, includ¬ 
ing the Benjamini-Hochberg procedure [3] where the threshold a is chosen in a way that adapts to 
the amount of signal in the data, to control the proportion of false discoveries. Historically, many 
methods have been developed that treat the n hypotheses interchangeably, while in practice we may 
often have additional information—for instance, a group structure on the set of hypotheses (where 
we believe that the true signals are likely to cluster in these groups), a hierarchical structure (for 
instance, in a regression problem, we might believe that an interaction effect between variables Xj 
and Xk should only be present if Xj and Xk each show marginal effects on the response as well), 
or an ordering or partial ordering (where prior information leads us to believe that some hypotheses 
are more likely to contain a true signal than others, or where we want to test hypotheses in a certain 
order to respect the structure of the problem). In this work, we treat this last case. 
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To describe the problem more precisely, suppose that we are interested in testing n hypotheses, de¬ 
noted Hi,, Hn, and experimental data has yielded individual p-values for each of these hypothe¬ 
ses, which we write as pi,... ,p„. For a concrete example, we might be searching for a genetic 
cause for some particular disease, in which case we might test a hypothesis Hi that the ith SNP 
in our experiment is associated with the disease, where the index i = 1,... ,n labels each of the 
SNPs being tested. The ith p-value pi would be a measure of the association between the ith SNP 
and the disease, and we assume a null distribution pi ^ Uniform[0,1] whenever the ith SNP has no 
association with the disease. 

In many practical settings, the experiment that we would like to analyze has been carried out in the 
context of existing information from previous studies. At the same time, often this prior information 
cannot simply be treated as additional data in the statistical analysis. For instance, in the example 
above where pi is a p-value testing the association between SNP i and some phenotype of interest, 
we might have prior information available from earlier experiments that may have; 

• studied a different but related disease; 

• tested a different population of patients; 

• used a different experimental protocol for genotyping the individuals; 

• defined disease status differently, or measured a real-valued phenotype differently; or 

• produced data that we believe may be unreliable. 

In any of these scenarios, the data from the previous study cannot simply be integrated with our 
new experimental data, without violating the integrity of the statistical analysis. However, this prior 
information is extremely valuable and can give us some power to detect signals in a very high¬ 
dimensional setting (e.g. n in the millions or more, with very few true signals). 

While the scenarios described above can correspond to many different forms of prior information 
about the n hypotheses being tested, in this work we focus on the specific problem of testing the hy¬ 
potheses when the only prior information is a ranking of the list. Before performing the experiment, 
we use prior information to generate a ranked list of hypotheses Hi, H 2 ,. ■., Hn, where Hi is the 
hypothesis that we believe is most likely to correspond to a true signal, while is the one believed 
to be least likely. After gathering new data, we then wish to test these hypotheses while taking this 
ordering into account. 

Application to high-dimensional regression In addition to situations where prior information, or 
data from related experiments, may provide a ranked list, this type of setting is also applicable to 
other statistical problems. As a key example, consider the problem of inference for sparse regression, 
where a response y depends on some sparse subset of many possible features Xi,..., Xp. A sparse 
linear model for this data states that y = X j3 + e, where /3 e ffi*’ is a sparse vector of coefficients 
while e contains the (zero-mean) noise in the response variable y. When the sample size n is lower 
than the number of features p, classical methods for performing inference on the coefficients of /3, 
such as testing hypotheses of the form j3j = 0 or hnding conhdence intervals for the coefficients /3j, 
cannot be applied as the linear model is not identifiable (i.e. X^X is rank-dehcient). In the recent 
literature, many approaches have been proposed for this inference problem, including a recent line of 
work by Taylor et al. [37] that, when paired with the Lasso (penalized regression) or with a forward 
stepwise selection procedure, calculates p-values for each feature in the order that they are selected. 
This provides a list of p-values with an inherent ordering, and therefore is an example of the ordered 
hypothesis testing problem we consider here. 
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Outline The remainder of this paper is organized as follows. Section 2 formally introduces the 
sequential testing problem considered here, gives background on several existing methods, and de¬ 
velops our family of methods that generalizes these existing works. In this section we also sum¬ 
marize some related problems and existing methods, and discuss our method in the context of the 
surrounding literature. Section 3 presents the theoretical results of this paper, including results for 
FDR control and for the power of the method in both finite-sample and asymptotic settings. These 
theoretical results are proved in Section 4, with some proofs deferred to the appendix. Section 5 
presents experiments on simulated data to validate our theoretical results and provide an empirical 
comparison of various choices of the method within the general family. In Section 6 we adapt our 
method to a dosage-response problem, where for a large set of genes, we would like to determine 
which genes respond to a particular drug and, for the responsive genes, what is the minimal dosage 
that induces a response—this section includes an experiment on real data demonstrating the effec¬ 
tiveness of our method in this setting. We conclude with a discussion of our findings and future 
directions in Section 7. 

Code implementing the accumulation test methods in R [29], along with code to reproduce the 
simulated data experiment and the gene dosage data experiment, is available online.' 


2 Problem and method 

We begin by formally defining the problem that we consider here. Let "Hq C {l,...,n}bea fixed set 

(the “null hypotheses”), and let pi,... ,p„ € [0,1] be random variables, such that pi ~ Unif[0,1] 
for all i G Ho, and furthermore the null p-values are independent of the non-null p-values. Our 
method will construct a cutoff point k that is adaptive to the data—formally, this cutoff is a function 
mapping the observed p-values (pi,... ,p„) to some k G {0,..., n}. This cutoff k is the output 
of our procedure, and should be interpreted as labeling the first k hypotheses, i.e. Hi,, Hp as 
“discoveries” (to use the terminology of hypothesis testing, we reject hypotheses Hi,..., H'j: and 
do not reject • ■., i7„). 

Ideally, we would like to choose k so that the selected list Hi,..., H'j^ contains only true signals and 
the remaining hypotheses H-j :^.^, ■ • ■, Hn are all null. However, this may not be possible because our 
initial ranking may be imperfect—the ranked list Hi,..., Hn may contain signals and nulls inter¬ 
spersed with each other, meaning that there is no threshold that perfectly separates the signals from 
the nulls. However, the ranking is nonetheless informative if the signals are indeed concentrated 
towards the top of the ranked list, and we select k with the goal of detecting as many signals as 
possible without too many false positives. To quantify this, we define the false discovery proportion 
(FDP) cumulatively along the list: 


FDP(t) = , 

where FalsePos(fc) = ^{i < k : i G Ho} is the number of false positives among the first k 
hypotheses. In other words, FDP(A:) gives the proportion of false positives (i.e. null hypotheses) 
among the first k hypotheses in the list, i.e. Hi,... ,iTfe. To agree with the definition of false 
discovery rate used in the literature, we define FDP(O) := 0 to cover the case that no discoveries 
are made; more formally, we can write FDP(fc) = both cases k = 0 and k ^ 0. 

'http;//www.stat.uchicago.edu/~rina/accumulationtests.html 
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For ease of notation, we will omit this more precise definition and will write ^vith the 

understanding that ^ is treated as 0. 

Selecting a threshold k involves a tradeoff; we would like a high k to ensure that as many true signals 
as possible are captured in the selected list Hi,, H'j^, but at the same time the proportion of false 
positives will generally increase farther down the list, since the signals will be concentrated towards 
the top of the ranking. In particular, we would like to bound the false discovery rate (FDR) [3], 
defined as the expectation of the FDP (where the expectation is taken with respect to the distribution 
of the p-values): 


False discovery rate 


= E 


FDP(fc) 


= E 


FalsePos(fc) 

k 


A note on the ranking In the setting described in the Introduction, where prior experience or 
preexisting data allows us to rank the hypotheses from most to least likely, this ranking is reflected 
in the indexing of the list. That is, before gathering data for the current experiment, we find the 
hypothesis that we deem to be most likely and label it Hi, then the next most likely hypothesis is 
labeled H 2 , etc. Implicit in the assumption above is the idea that this ranking takes place before 
the p-values are generated, that is, the p-values are independent from the process of ranking the 
hypotheses. For instance, we cannot use a data set to rank the hypotheses and then use the same data 
set to calculate p-values. 


Existing methods Suppose that we would like to select a cutoff k that is as large as possible, 
while bounding the FDR at some prespecified level a (e.g. a = 0.1). Two approaches for the 
ordered hypothesis testing problem have been proposed recently in the literature. First, G’Sell et al. 
[22] propose the ForwardStop method; 


ftporwardStop = max |fc G {1,..., n} ; ^ ^ log - *^1 ’ 


( 1 ) 


with the convention that if this set is empty, we set /cporwardStop = 0 and make no rejections. To 
understand this method, consider a single null hypothesis Hi. Since pi ~ Uniform[0,1] by assump¬ 


tion, it is true that 1 


log ~ Exp(l) and in particular, E log 


1. Then, for any fixed 


potential cutoff point k. 


E 


^log 



> E 


^ log 

i<k,i£l-LQ 



FalsePos(fc) . 


This implies that, for any fixed k, the cumulative sum j: Y^^=i log j is, in expectation, at least 

as large as FDP(A:). Therefore, the cutoff fcporwardStop is the last time when the estimatedFDP lies at 
or below the predetermined level a. G’Sell et al. [22, Theorem 1] prove that this procedure controls 
FDR at the level a, that is. 


E 


E D P ( /cporwardStop ) 


< a 


A second existing method uses a similar summation to estimate the FDP, but with a discrete step 
function rule rather than continuous measure; given a parameter C > 1, the Sequential Step-up 
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Procedure (SeqStep) of Barber and Candes [1] sets 


'^SeqStep(C) 


= max 


A: G {1,..., n} : i ^ C • Ife > 1 - 1/C} < 


( 2 ) 


As for the ForwardStop procedure [22], we see that this cumulative summation serves to (over)estimate 
the FDP; for any null hypothesis Hi, clearly E [C • l{pi > 1 — 1/C}] = 1 and so 


E 


■ k 

> l-l/C} 

> E 

^ C-lfe> 1-1/C} 





= FalsePos(fc) . 


Barber and Candes [1, Theorem 3] prove that this procedure controls a modified form of the FDR; 

E ^SeqStep(C) * ^ ^ 

—1 < a (3) 

CjQC -f A;5eqStep(C) 

For exact FDR control, the same paper proposes a slightly more conservative variant, the SeqStep+ 
method [1], defined by 


fcseqStep+(C) = maxjA: G {1,... ,n} : |^C + ^C • l{pi > 1 - 1/C}j < a| , (4) 

with the guarantee that for any C > 1, 

E [FDP(fcseqStep+(C))j < « • 


A general family of methods Examining the two procedures described above, G’Sell et al. [22]’s 
ForwardStop procedure and Barber and Candes [l]’s SeqStep procedures, we see that the two have 
a common structure; they each compute an (over)estimate of the false discovery proportion at each 
point in the path, FDP(fc), via two different choices of cumulative sums. We now generalize these 
two procedures with a broader family; 

Definition 1 (Accumulation test for ordered hypotheses). Suppose we are given a ranked list of 
n hypotheses Hi,, Hn with corresponding p-values pi,... G [0, Ij. Fix any function h ; 
[0,1] I—)■ [0, oo) that satisfies h(f) dt = 1; we call h the “accumulation function”. Define the 
estimated FDP at each cutoff k G {1,..., n} as 

FDPh(fc) = , 

k 


and then select the adaptive cutoff 


fch = max |fc G {1,... ,n} ; FDPh(A:) < a| , 

where a G (0,1) is a prespecified target FDR level. (We use the convention that fcf, = 0 if this set is 
empty.) We then reject the hypotheses Hi,..., H^^ and do not reject ) ■ • ■) Hn. 
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Figure 1: Illustration of several choices of the accumulation function h : [0,1] —>■ [0, oo). 


Note that our notation implicitly treats the desired FDR bound a as hxed, while h appears in the 
subscript to emphasize that we may choose between many candidate accumulation functions h. In 
particular, choosing the accumultation function hForwardStop(f) = log yields G’Sell et al. 

[22]’s ForwardStop procedure, while hseqStep(f) = C ■ l{f > 1 — l/C} yields Barber and Candes 
[l]’s SeqStep procedure (for any parameter C > 1). We will also study an additional choice for the 
accumulation function h, the “HingeExp” function. 


httingeExp (f) 



for f > 1 — l/C, 
for t <1 — 1/C. 


(5) 


This function combines the ideas of the step function used in SeqStep with the ForwardStop method; 
Figure 1 gives a visual comparison of these methods. The name “HingeExp” arises from the “hinge 
point” of the function atf = l — l/C (similar to the hinge loss function in machine learning), com¬ 
bined with the observation that, for a null p-value pi ~ Unif[0,1], we have hnmgeExpfe) distributed 
as C times an Exp(l) random variable with probability 1/C (and equal to zero otherwise). We 
will see that the HingeExp method offers superior empirical performance compared with existing 
options, in many settings. 


Eor any accumulation function h, as before the data-dependent cutoff should intuitively control 
the false discovery rate at the desired level. This is because, for any hxed k, the sum X]i=i Kf*) 
the estimated “accumulation” of false positives by the time we have reached the fcth position in the 
list; 


k 

> E 




i<k,i^T-LQ 


FalsePos(/c) , 


where the last step holds because E [h(pi)] = 1 for i G Hq due to the requirement h(f) dt = 1. 
Therefore, for each fc, the estimated false discovery proportion FDPf,(fc) is an overestimate of the 
actual FDP: 


E 


FDPh(A;) 


E 


E k 


h(Pi) 


^ > 


FalsePos(fc) 


= FDP(A:) 


(6) 
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Preview of theoretical results Our first main result, Theorem 1, proves that, for any choice of 
the accumulation function h, the method described above satisfies a bound on modified FDR similar 
to the bound (3) obtained in [1], with no further assumptions other than the ones described above. 
In the simplest case, for a desired false discover rate bound a, and for any bounded accumulation 
function h : [0,1] —)■ [0, C], we have 

FalsePos(fch) ^ 
t — < a . 

C/a + ku 

This result is given in Theorem 1, along with generalizations to other formulations of the test and 
to unbounded accumulation functions. Of course, the quantity whose expectation is bounded, is 
slightly different from the FDP due to the added term C/a in the denominator. Intuitively, this 
should make little difference when fcf, is large, and indeed in Theorem 2 we prove that the accumu¬ 
lation test controls the FDP asymptotically. 

Since we are free to choose any accumulation function h, we naturally would like to know how 
to choose this function to maximize our power for detecting signals. Intuitively, an accumultation 
function h will have good power if E [h(pi)] is as low as possible for the non-null p-values, since 
these expectations control the extent of the overestimation of FDP (see (6)). Our theoretical re¬ 
sults for this setting show that lower expectations E [h(pi)] for i ^ Tfg leads to higher power in 
an asymptotic setting (Theorem 3). Minimizing E [h(pi)] over the non-nulls i ^ "Ho remains an 
open question, although somewhat surprisingly, if we restrict our attention to bounded accumulation 
functions only, we find that the the step functions used in the SeqStep procedure [1] achieve the best 
(lowest) expectations E [h(pi)] for non-nulls i ^ Hq (Lemma 2). 

2.1 Related work 

As discussed above, several existing methods treat the ordered hypothesis testing problem. Our 
method generalizes the ForwardStop procedure by G’Sell et al. [22] and the SeqStop procedure by 
Barber and Candes [1] for controlling FDR in ordered setting. Another procedure known to us is 
the a-investing by Foster and Stine [18], which controls the the ratio (where V = number 

of false positives and R = total number of discoveries), a criterion weaker than the FDR. It allows 
users to incorporate prior knowledge such as ordering and improve the power. However, a-investing 
shows lower power than ForwardStop in simulations carried out by G’Sell et al. [22]. 

Ordered testing procedures have been shown to provide FDR control in regression models. Taylor 
et al. [37] derived post-selection hypothesis tests at each step of the forward stepwise and LARS 
procedures. These tests yield an ordered list of p-values, corresponding to a nested sequence of 
models. With the ordered testing procedures, these sequential p-values can be transformed into a 
model selection procedure with FDR guarantee ([22]), which we explore in Section 5. 

While research on ordered multiple testing is limited, there is rich literature on the general multiple 
testing and FDR control. Here we discuss three widely used methods: the Benjamini-Hochberg (BH) 
procedure [3], Storey [34]’s modification of the BH procedure, and the empirical Bayes method 
(Efron et al. [17], Efron and Tibshirani [16]). The BH procedure for EDR control works for any 
configuration of null and non-null hypotheses, given that the p-values corresponding to true nulls 
are independent. In the dependent case, Benjamini and Yekutieli [5] showed that the BH method still 
controls EDR if the joint distribution of statistics (or p-values) is PRDS on the set of true nulls (PRDS 
stands for "positive regression dependence on a subset", e.g. Gaussian variables with totally positive 
covariance matrix is PRDS), or if the BH procedure is conducted with the EDR level one log(n)-th 



of the target FDR level, where n is the total number of hypotheses. The BH procedure has been 
widely applied to examine large scale datasets, including microarray gene expression data ([30], 
[13]), brain fMRI ([20], [24]), etc. Storey [34]’s procedure replaces n in the BH procedure with an 
estimate of the total number of nulls, which may be substantially smaller than n. Therefore, it is less 
conservative and has greater power than BH procedure ([34], [35]). The empirical Bayes method 
[17] calculates the Bayesian FDR as the posterior probability of null conditioned on rejection, from 
the estimates of prior probabilities and densities of nulls and non-nulls. Efron et al. [17] and Efron 
and Tibshirani [16] also proposed local EDR, the probability of rejecting a null in a subset of the 
rejection region, and showed its application in the analysis of breast cancer microarray data. 

In multiple testing, when hypotheses share a hierarchical, or group dependence structure, this in¬ 
formation can be utilized to improve the power of testing, and the interpretability of results. The 
hierarchical testing procedure by Benjamini and Yekutieli [6] incorporates known hierarchical de¬ 
pendence between hypotheses by arranging them on a tree. At each step, hypotheses within the 
same family (i.e. with the same immediate parents) are tested simultaneously, and for the signif¬ 
icant cases, the process keeps moving forward. This procedure can target the control of EDR at 
full-tree level, at a certain depth, or among the outer-nodes. The Group BH procedure by Hu et al. 
[26] groups the hypotheses based on known dependence information (such as genes within the same 
biological pathways, or sharing the same phenotypes). The proportion of non-nulls in each group is 
then estimated, and the original p-values are reweighted to emphasize the groups with higher esti¬ 
mated proportions of non-nulls before the final BH step. Other studies or approaches that fall into 
this category are the pooled and separate analysis ([14]), the decision theoretical approach (Cai and 
Sun [8]), and the selection-adjusted procedure (Benjamini and Bogomolov [2]). In the brain fMRI 
application, Heller et al. [24] grouped voxel units into clusters using previous correlation data, and 
applied the BH method at the cluster level. Their approach enjoys greater interpretability, as well as 
increased power. 

Besides hierarchical and grouped testing, there are several approaches in the literature to incorporate 
prior information into multiple testing. One approach is by weighting the hypotheses with prior 
information ([4], [21], [31], [26]). Another approach, mostly applied in microarray analysis, uses 
prior information to exclude non-informative genes before the final selection step of significant ones, 
which enhances the power ([7], [23], [27], [36]). Also worth noticing is the Bayesian mixture model 
approach to include previous knowledge in genome-wide linkage studies and association studies 
([28], [19]). Recently, Du and Zhang [12] introduced a single-index modulated (SIM) procedure, 
which assumes the availability of a bivariate p-value (pi,p 2 ) (where pi is the p-value from prior 
information, and p 2 is the main p-value reflecting curent information), and project it into a single p- 
value combining pi and p 2 in some optimal direction for the final analysis. With prior information, 
this approach could improve the power significantly while maintaining the control of EDR. 

Scott and Berger [33] explored a Bayesian hierarchical approach for multiple testing. The posterior 
probabilities, including the probability that hypothesis i is null given the data, are inferred through 
importance sampling. They discussed the choice of prior distributions on model parameters. This 
approach has been applied in disease mapping [9], and abnormal corporate performance identifica¬ 
tion [32]. 


3 Theoretical results 

In this section we develop and prove results on the EDR control properties, and the power, of the 
family of accumulation test methods. In Section 3.1 we prove that accumulation tests control the 
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FDR (or modified FDR) in the finite sample and asymptotic settings. In Section 3.2 we examine the 
power of the accumulation tests in finite sample and asymptotic settings. 


3.1 False discovery rate control 

3.1.1 Finite sample setting 


We begin with concrete finite-sample results to bound the false discovery rate of the family of ac¬ 
cumulation tests. Recall that, when a threshold k G {1,... ,n} is selected, we are interested in 
bounding the false discovery proportion 


cnnn ^ _ FalsePos(fc) _ ^{i < k : i G Ho} 
- - - - 


(with the convention that FDP(O) = 0). We will also consider a modified form of the FDP, intro¬ 
duced in Barber and Candes [1] for the SeqStep method, given by 


FalsePos(fc) < k : i G Hq} 

c + k c + k 


Of course, when c is a constant while k is large, the modified FDP is nearly identical to the FDP. 
Our main result shows that the accumulation test controls the modified FDP. Furthermore, a slightly 
more conservative test (defined in the theorem) controls the original FDP. 

Theorem 1. Let h : [0,1] —)■ [0, oo) be any function with h(f) dt = 1, and let a G (0,1) be 
some prespecified target FDR level. Fix any C > 0. Define 


fth = max <! A: G {1,... ,n} : - ^ h{pi) < 


with the convention that fcf, = 0 if this set is empty, and define 


= max <A:G{l,...,n}: 


1 


1 -I- A: 


C+ ^KPi) < 


(7) 


( 8 ) 


with the same convention. Then, in the special case that maxo<t<i h(f) < C, we have 


E 


mFDPc/a(A:h) 


< a and E 


FDP(fc+^) 


< a . 


In the general case, with no restriction on the range ofh, we have 

a 


E 


mFDPc./a(A;h) 


/do h(A) A C dt 


and E 


FDP(fc:^) 


/do h(f) A C dt 


where we use the notation a Ab := min{a, 5}. 


(9) 

( 10 ) 


We also give a result specifically for the HingeExp function, which gives a tighter bound than that 
guaranteed by Theorem 1: 

Lemma 1. Let h(f) = C • log HingeExp function with parameter C. 

Then, under the same definitions and assumptions as Theorem 1, 


E 


mFDP 


2C/a 


ikh) 


< a . 
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Comparison to existing results As discussed in Section 2, the accumulation test contains two ex¬ 
isting procedures as special cases: SeqStep [1] with the step function h(f) = C ■ l{f > 1 — l/C'}, 

and ForwardStop [22] with h(f) = log^Y^^. Furthermore, the slightly altered accumulation 
test given in (8) contains as a special case the SeqStep-t procedure [1], again with h(f) = C ■ 
l{f > 1 — l/C}. For the special cases of SeqStep and SeqStep-t, our main result, Theorem 1, 
obtains the same guarantee on modified and original FDP as proved in Barber and Candes [1, The¬ 
orem 3]. For the special case of the ForwardStop procedure, the results obtained in Theorem 1 and 
Lemma 1 are somewhat weaker than the guarantee given in G’Sell et al. [22, Theorem 1], which 
proves that 


E 


FDP(fch) 


< a , 


that is, a guarantee on the FDP rather than the modified FDP. As we will see in Theorem 2, however, 
asymptotically the same result is obtained. 


A conjecture Based on the results of Theorem 1 and Lemma 1, we conjecture the following gen¬ 
eralization: 

Conjecture 1. Let h : [0,1] —)■ [0, oo) be any accumulation function satisfying h(f) df = 1 and 
It=o [^(^)] ^ df < C, for some C > 1. Under the same definitions and assumptions as Theorem 1, 


E 


mFDPc/a(fch) 


< a . 


Note that, if true, this conjecture would replicate the results of Theorem 1 for bounded functions 
and of Lemma 1 for the HingeExp function, and would strengthen the results of Theorem 1 for 
unbounded functions. 


3.1.2 Asymptotic setting 


In our finite-sample result (Theorem 1), we proved that the accumulation test controls a modified 
form of the FDP, which as discussed earlier, is nearly equal to the original FDP as long as the number 
of rejections /ch is large. Next, we show that the accumulation test controls FDR asymptotically as 
long as the number of rejections tends to infinity. 

Theorem 2. Let h : [0,1] —)■ [0, oo) be any function with h(f) dt = 1, and let a € (0,1) be 
some prespecified target FDR level. Consider a sequence of ordered hypothesis testing problems, 
with n = 1,2,.... Suppose that there exists a sequence € N with rrin —>■ oo such that 

P I fch < I —> 0 as n —)■ 00 . 


Then 


lim E 

n—>-oo 


FDP(fch) 


< a 


3.2 Power 

Up to this point, our discussion and theoretical analysis of accumulation tests has focused on con¬ 
trolling the false discovery rate. Of course, in practice we are interested in balancing the goals of 
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reducing false positives (Type I error) while increasing the number of true discoveries (power). For¬ 
mally, we will define the power of a cutoff k G {1,..., n} as the proportion of non-nulls that are 
discovered by the test, 


Power(fc) 


#{i <k:i^ 'HqI 
#{i : t ^ "Ho} 


and will aim to choose the accumulation function h to maximize the expected power of the accumu¬ 
lation test. 


E 


Power(A:h) 


Recall that ky, is defined as the largest k such that 

FDPh(fc) = < Q,. 

k 

In other words, high power corresponds to a large value of k, which is possible only when the 
estimates FDPh(fc) are low. Therefore, a good choice of the accumulation function h is one that al¬ 
lows for low estimates FDPh(fc) of the false discovery proportion along the sequence of hypotheses. 
Consider the expectation of this estimate at any fixed k. 


k “ k k 

= FDP(A:) + ^ 

k 

where as before the last step holds because we require h(f) df = 1 and so E [h(pi)] = 1 for 
i G Tio- Therefore, for a given choice of h, we see that FDPj,(fc) is an overestimate of FDP(fc), with 
bias given by 

k ■ 

Since power will increase if the estimated false discovery proportion FDPh (fc) is small, we see that a 
good choice of accumulation function h is one that minimizes E [h(pi)] for non-nulls i ^ Hq (while, 
of course, satisfying the requirement E [h(pi)] = 1 for i G Ho). 

In the following section, we will examine how E [h(pi)] affects the power by studying an asymptotic 
scenario. We will find the power of any accumulation function h can be characterized exactly by 
its expected value for non-null p-values (Theorem 3). If all the non-null p-values follow a single 
distribution, pi ~ I? for i ^ Hq, we can think of these results as characterizing the power of an 
accumulation function h for testing the null hypothesis given by the uniform distribution against the 
alternate hypothesis given by the distribution T>. 

Of course, in practice, we will not always know the distribution of the non-null p-values. In Sec¬ 
tion 3.2.2 we discuss the problem of determining a good choice of accumulation function h without 
prior knowledge of an alternate distribution. 

Before we proceed, we recall the definition of a subexponential random variable; 


E 


FDPh(fc) 


X is subexponential with parameters if E 


J{X-E[X]) 


< e®"'""/^forall |6»| < ^. (11) 
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Note that a cr^-subgaussian random variable, such as a N{0, variable, is (cr^, 0)-subexponential 
trivially, and so the subexponential condition is weaker than the subgaussian condition. 

We will say that an accumulation function h is (cr^, 6)-subexponential with respect to the p-values if 
h(pi) is (ct^, 6)-subexponential for each i = 1,..., n. In particular, note that the SeqStep, Forward- 
Stop, and HingeExp functions are all subexponential (as long as the p-values are all dominated by 
the uniform distribution). 


3.2.1 Power calculation in an asymptotic setting 


In this section, we show that we can calculate the asymptotic power of the accumulation methods, 
and compare between them, under the assumption that the proportion of non-nulls along the list is 
converging to a fixed function /(•), where /(•) must satisfy some mild conditions. Specifically, we 
consider an asymptotic scenario where 


max 

k—0,...,n 


#{t <k Ho} 
k 



( 12 ) 


and e„ —)■ 0 as n —> oo. We assume that / : [0,1] —)■ [0,1] is differentiable and satisfies, for some 
constant 5 > 0 and for the fixed target FDR level a. 


(fit) < 0 for all f, 

S < —S for all t such that f{t) > 1 — a, (13) 

[f I—)■ f • /(f) is a nondecreasing function. 

In words, these conditions require that the proportion of true signals decreases as we move along the 
list; that the proportion decreases at a rate bounded away from zero during the initial portion of the 
list; and that the number of true signals must of course increase as we move along the list. 

Then we have the following theorem: 

Theorem 3. Fix any > 0 and b > 0, any target FDR level a G [0,1], and any /i G (0,1). 
Suppose that: 


• The p-values are independent, with pi ~ Unif[0, 1] for all i G TLo andpi ~ D for all i (f TLo 
for some distribution D; 

• The function h : [0,1] i—)■ [0,oo) satisfies Epi.,..unif[o,i] l^iPi)] = 1 and Ep.„.,x) [h{pi)] = p, 
and is (cr^, b)-subexponential with respect to the p-values; and 

• The function f : [0,1] -G [0,1] satisfies assumptions (12) and (13). 


Define 


0 , 




T=lf-^[]f^)G{0,l), ///(!)< iff </(O), 


1 , 




Then the power converges as 


Power(fch) = 


{l,...,ku}\Ho 


T 


f{T) 


Ni{n) /(I) 

where specifically this denotes convergence in probability as n ^ oo. 
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Remark 1. Note that, when holding a and / (•) fixed, T is a nonincreasing function of /r = Ep.„^x> [hfe)]- 
Therefore, the (asymptotic) power is a nonincreasing function of p as well, since T i—> T • /(T) is 
nondecreasing. This means if E [ho(pi)] < E [h(pi)], for i ^ "Ho, the accumulation test using ho(-) 
is asymptotically more powerful than that using h(-). 

3.2.2 Choosing the accumulation function 

While our earlier results prove control of the (modified) FDR for a broad class of accumulation 
functions, without any knowledge about the presence or absence of true signals, our understanding 
of the power of these methods does depend on the “alternate hypothesis”, i.e. the distribution of 
the non-null p-values. In particular. Theorem 3 shows that the power of some specific accumulation 
function h can be viewed, asymptotically, as a simple function of this alternate hypothesis. However, 
without knowledge of this alternate hypothesis, how can we proceed—that is, how can we choose 
an effective h? 

Here we suggest two partial answers to this question. First, observe that our FDR results. Theorems 1 
and 2, rely only on the exchangeability of the null p-values {pi : i G T-Lq], rather than requiring a 
strict i.i.d. assumption. Therefore, we can observe the unordered set of p-values, {pi : i = 1,. .., n} 
before choosing the accumulation function h, without negating the FDR control properties of the 
method. In particular, we may then use an empirical Bayes method (e.g. see Efron [15]) to estimate 
the distribution of the non-null p^’s, which can then guide us in selecting h. 

Quite unexpectedly, in one special case it is possible to choose an optimal h without any knowledge 
of the alternate distribution; the case of bounded accumulation functions. The following result 
shows that the step function h(f) = C ■ which is used in the SeqStep method of [1], is 

the optimal C-bounded accumulation function under a very mild assumption on the distribution of 
non-null p-values: 

The non-null p-value pi has a density fi : [0,1] —> [0, oo), where fi is a nonincreasing function. 

(14) 

We say that pi satisfies the assumption (14) strictly if its density fi is a strictly decreasing function. 

Lemma 2. Consider any accumulation function h bounded by C > 1, that is, h : [0,1] —)■ 
[0,(7] satisfies /(_Qh(f) di = 1- Let hp be the step function with the same bound, ho(f) = 

(7 • l{f >1 — 1 /(7}. Suppose that a non-null p-value pi satisfies the assumption (14). Then 

E[h(pQ] > E[ho(pi)] ■ 

Furthermore, the inequality is strict whenever pi satisfies (14) strictly, unless h(f) = ho(f) almost 
everywhere ont G [0,1]. 

In other words, based on the discussion above, we expect the step function (the SeqStep method) to 
offer more power than any other accumulation function that maps to the same range, as long as the 
non-null p-values satisfy the assumption (14). 

The assumption (14) is very natural, because we expect non-null p-values to give evidence against 
the null hypothesis, i.e. the distribution of a non-null pi should place more mass on low values (near 
0) than on high values (near 1). As a specific example, suppose that we are performing a z-test 
on statistics Zi i N(p,i,l), where the ith null hypothesis is that pi = 0. We therefore compute 
p-values with a two-tailed z-test, pi = 2(1 — $(|Zi|)), where $ is the standard normal cumulative 
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distribution function. Then for any non-null i, with Zi ^ N{ni,l) for ni ^ 0, -pi follows the density 


/*w 


2 e - t ‘?/2 


which is a strictly decreasing function of t due to the fact that x + e ® is a strictly increasing 
function of x when a; > 0. 

We expect that similar results may be possible under weaker assumptions on the accumulation func¬ 
tion h (e.g. an assumption of bounded moments or subexponential tails), and leave this question to 
future work. 


4 Proofs 

In this section, we give the proofs of some of our main results: Theorems 1 and 2 for FDR control, 
and Theorem 3 for the power, with some details deferred to the Appendix. The proofs of Lemmas 1 
and 2 are given in Appendix A.5 and Appendix A.3, respectively. 


4.1 Proof of Theorem 1 (finite-sample FDR control) 

The key ingredient for the proof of Theorem 1 is the following lemma: 
Lemma 3. Let oi,..., a„ > 0 he any fixed thresholds, and let 


k = max < fc e {1,..., n} : E h(Pi) < Ofc \ 


with the convention that k = 0 if this set is empty. Then 


E 


1 + #{* < k : i G Ho} 


c + E 




Kp^) 


iLo h(0 A C dt 


(15) 


To understand the role of this result in proving Theorem 1, first note that the definitions of ku and 
k^'^, given in (7) and (8), can each be rewritten as a threshold criterion of the form (15) as given in 
Lemma 3. 

Essentially, Lemma 3 shows that, at fc = /ch (or at fc = k}}'^), we have ^(Pi) « ^ 

k : i G Ho}, and thus, this result guarantees that the estimated FDP, FDPh(A:) = ^ 

a reliable (over)estimate of the actual FDP, FDP(fc) = ^ Qjyen this lemma, the proof 

of the bounds in Theorem 1 follows the arguments in Barber and Candes [1, Theorem 3]; we give 
details in Appendix A.l. 

In order to prove Lemma 3, we use a result that treats the Bernoulli case specifically: 

Lemma 4 (Adapted from [1, Lemma 4]). Let Bi, ..., Bn G {0,1} be independent, with Bi ~ 
Bernoulli(p)/or flZZ i G Ho- Let {Hk}i=i,...,n be any filtration in reverse time (i.e. Hk If Hk+i) such 
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that 

Bi G J^k for all i ^ Hq, and for all i > k with i G Hq, 

Bi G Tk, and 

i<k,i£l-iQ 

{Bi : i < k,i G Bq} are exchangeable with respect to JFk, 
for all k = 1,..., n. Then 

1 + < k : i G "Ho} 

= 1 I Y-- u — 

^ ~r l^i<k,iGHo 

is a supermartingale with respect to {Bk}, and E \Mf\ < i. 

Equipped with this result for the special case of Bernoulli variables, we turn to the proof of our key 
lemma, where we construct a coupling between the general case and the Bernoulli case. 

Proof of Lemma 3. First, recall that the null p-values, pi for i G Bq, are i.i.d. draws from Unif [0,1] 
and are independent from the non-null p-values. Therefore, we can treat the non-null p-values, pi 
for i Bo, as fixed (by conditioning on their values). Define additional variables Vi ~ Unif[0,1], 
independent from the p/s. Define also 

B, = 1{V) < Kpi)IC{ . 

Write pi-n to denote pi,... ,p„. Note that, conditioning on pi:„, we then have that the Efs, are 
independent, with distributions 

(A I Pi:n) ~ Bernoulli ^ (jq) 

Furthermore, marginally, we see that for all i G Bo, the Bfs are independent Bernoulli variables 
with 

E[i3,] ^ h(p,)ACdf=:p. 

^ J ^ Jt^o 

Next, we would like to apply Femma 4 to bound E . First, we need to construct a 

filtration {Bk}k=i,...,n that satisfies the conditions of this lemma. 

Fet Bk be the cr-algebra defined by knowing {pi, Vf) for all i ^ Bo, knowing (pi, Vf for alH > fc 
with i G Bo, and knowing {{pi,Vi) : i < k,i G Bo} (note that this is an unordered set—for 
instance, if 1 G Bo and 2 G Bo, then B 2 knows (pi, Vi) and (p 2 , V 2 ) but does not know which one 
is which). For each i, Bi is a function of {pi, Vi). Furthermore, the (pi, Vifs are i.i.d. for i G Bo- 
Therefore, this filtration satisfies the conditions (16), (17), and (18) of Femma 4. Furthermore, k 
as defined in (15) is a stopping time with respect to {Bt}. Combining Femma 4 with the Optional 
Stopping Theorem, therefore. 


(16) 

(17) 

(18) 


g l + #{i<k\iG Bo} ^ 1 ^ _1_ 

1 . ~ P h P'^ 
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Next, we calculate 


1 + #{* < k : i G Ho} 


1 + E 


TD 

i<k,iGno * 


IP nr ^ < k : i G Ho} u fu f If » f 

= t t -=- pi-n by the tower rule or expectations 


= E (1 + #{i < k : i G Ho}) ■ E --=-— pi:„ since fc is a function of 

~k l^i<k,ie'Ho J. 


> E (1 + < k : i G Ho}) 


1 + J2i<k,iG'Ho k'l--n 


by Jensen’s inequality 


= E (1 + #{i < k : i G Hq}) 


1 _i_ _ n{pij/' 

^ l^i<k,i£'Ho C 




> C • E (1 + < k : i G Ho}) 


^ + ^i<k,ieHo J 


Combining this result with (20), we have proved the lemma. 


4.2 Proof of Theorem 2 (asymptotic FDR control) 

Proof of Theorem 2. Take any sequence Cn > 0 with Cn ^ oo and ^ -G 0 (for instance, we may 
take Cn = -Jrnf). Then note that 


lim [ h(f) A Cn dt = 1 
Jt=o 

because we know that h(f) df = 1 by assumption. We then have 

E FDP(fch) =E FDP(A:h) • l|fch < w„| +E FDP(fch) • l|fch > TOn| 
< P (t, < + E . l{tH > m„} 


<Pa<m„|+E + 

>■ J [ Cn/a + fch J "i" 

= p|4 <m„|+E[mFDPc„/a(4)l . 

K J L J 

< P < fch < m„ 1 + -T-- 

^ ^ //^ph(f) AC„ df mn 


by Theorem 1 . 
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Taking limits, 


lim E 

n—^oo 


fDP{k,) 


(p{i 


< lim \ ¥ <k[^ < m. 


Cn/a + m„ 


1 a 


= lim P I /ch < rrin | + 


a 

lim^-j-oo It=o 


Cnjoc + Tfln 


• lim 

n->oo rrin 


= 0 + Y • 1 = a 


□ 


4.3 Proof of Theorem 3 (asymptotic power calculation) 


We begin by stating a preliminary lemma on maximal values for random walks: 

Lemma 5. Let Xi, X 2 ,... be independent random variables with E [Xi] = 0, and such that Xi is 
(ct^, b)-subexponential (as defined in (1 l))/or all i. Then for any e G (0,1), 




< \/21og2 (V,) • max |(j, &'\/ 21 og 2 (ye)| • log(l + t) for all f > ll > 1 — e 


i=l 


This lemma is proved in Appendix A.2. With this result in place, we turn to the proof of the theorem. 


Proof of Theorem 3. First, we define the asymptotic expected FDP estimate along the sequence of 
p-values, 

E(f) = 1 - fit) • (1 - m) . 

That is, at the cutoff k = t ■ n, we expect that FDPh(A:) « E(t). Note that E(f) is monotone 
nondecreasing, due to the assumption that /(f) is nonincreasing. 

Next, we prove that the approximation FDPf,(fc) « E(f) is uniformly accurate. Fix any k G 
{1,. .., n}. First we consider the expectation: 

#{i<k:i^no} . ^ 


E 




= E 


fc-Eti(i-MK)) 


= 1 


and so, applying (12), 


E 


-Eff) 



k 

\nj 



#{t <k -.i ^ Ho} 
k 



•(1-F)<(l-M)en. (21) 


Next, we apply Lemma 5 to prove that FDPf,(fc) = 


_ ELi hjp,) 


E 


ELi Hp^) 

k 


for all sufficiently 


large k. Applying Lemma 5 with e = 


log(n) 


, we see that with probability at least 1 — 


log(n) ’ 


J2i=i KPi) _ ELi e [Hpi)] ^ 
k k - 

\J2 log2(4 log(n)) • max |(t, log 2 (4 log(n))| • ^ for all fc > 1 , (22) 
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and therefore, restricting our attention to fc > log(n) and combining with our result in (21), we have 


max FDPh(fc) — E — < 

fe>log(n) V^/ 


/2 log2(41og(n)) • max |cr, &V21og2(41og(n))| 


/log(l + log(n)) 
log(n) 


+ (1 ~ At)en —■ /?n ■ 


Note that /?„ —> 0. We also define t„ = where the constant S is from assumption (13), and 

note that t„ —)■ 0 also. 

Now we split into cases. Here we treat the case that /(I) < < /(O), and defer the other two 

cases to the Appendix. 


Case 1: a satisfies /(I) < < /(O). In this case, we will prove that, if (22) holds, then 

f FDPh(A:) < a for all log(n) < k < n ■ {T — Tn), and 
I FDPh(A:) > a for all k > max{log(n), n ■ {T + t„)} . 

If this holds, then by definition of kh, this implies that for sufficiently large n. 


{T - Tn) < ku < n ■ {T + Tn) , 


and therefore. 


<n-(T-Tn):i^ np} ^ < #{» < n ■ {T + Tn) : i ^ np} 


Niin) 

Using assumption (12), therefore, 

n- {T -Tn) ■ {f{T - Tn) - in) 
n- (/(l) + e„) 


< Power < 


Ni(n) 


(T + Tn) ■ {f{T + Tn) + in) 
n- (/(I) -e„) 


Since the limit of both sides is equal to T • this proves the desired result. To be more precise, 
we have proved that the bound (25) holds with probability at least 1 — (since this is a lower 

bound on the probability of the event (22)), which itself tends to 1. Therefore, the power of our 
procedure converges to the limit T • in probability. 

It remains to be shown that (22) implies (24). First, note that /(T) = > 1 — a, and so, since 

r„ —> 0 and / is continuous, we see that 

min f{t) > 1 — a 

te[T-T„,T+T„] 

for sufficiently large n. Therefore, by assumption (13), 

fit) < -6 for all tG [T -Tn,T + Tn] ■ 

Now take any k such that log(n) < k <n ■ [T — Tn)- Then 

FDPh(fc) < E + /3„ < E(r - r„) + /3„ = 1 - /(T - Tn) • (1 - ^i) + /?„ 

< 1 - (/(T) +Tn-5) ■il-p) + Pn = l- f[T) • (1 - m) = a , 
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where the first inequality applies (23). This proves the first part of (24); the second part of (24) is 
proved similarly. The remaining cases are proved in Appendix A.4. 

□ 


5 Simulations 

In this section we evaluate the performance of various accumulation tests on two tasks with sim¬ 
ulated data: a ranked hypothesis testing problem (Section 5.1), and a high-dimensional linear re¬ 
gression problem (Section 5.2). Code to reproduce the first simulated data experiment is available 
online.^ 

5.1 Simulations for the sequential testing problem 

Here we examine our accumulation test under four different simulation settings, and compare the 
performance of several accumulation test methods: SeqStep with parameter (7 = 2, SeqStep-t with 
(7 = 2, ForwardStop, and the new HingeExp method with parameter (7 = 2 (see Section 2 for 
the definitions of these methods). The sequences of hypotheses and of p-values in our simulations 
vary on the degree of separation between the nulls and the non-nulls (the extent to which non-null 
hypotheses concentrate early in the list), and on the signal strength of the non-nulls (the extent to 
which non-null p-values are visibly different from a uniform distribution). 

5.1.1 Methods 

To create the simulated data, we generate the sequence of p-values for n = 1000 hypotheses with 
100 non-nulls, by the following steps: 

1. First, we generate “prior information” for each hypothesis. We draw z-scores Zi indepen¬ 
dently, with Zi drawn from A^(0,1) for nulls i G Ho and from N (/ii, 1) for non-nulls i ^ Ho- 
Here pi > 0 controls the extent of the separation between nulls and non-nulls. 

2. Sort the z-scores in descending order according to magnitude: \Z(i)\ > |2’(2)| > • • • > |^(„)|. 
Assign a new index to each hypothesis, according to its position in the sorted list. 

3. Now we generate p-values for each hypothesis. We draw new z-scores Z* independently for 

each hypothesis, with Z* ~ W(0,1) for nulls i G Ho and Z* ~ 1) for non-nulls 

i ^ Ho- Here fi 2 > 0 controls the strength of the true signals. Then we calculate p-values 
with a two-tailed z-test, pi = 2(1 — <1>(|Z*|)), where $ is the cumulative distribution function 
of the standard normal distribution. Note that these p-values are independent from the process 
of ranking the hypotheses. 

The ranking (and separation) of nulls and non-nulls is determined in steps 1 and 2, and controlled 
by pi. Larger pi leads to better separation between nulls and non-nulls. The strength of non-null 
signals is specified in step 3, and controlled by p 2 - As p 2 gets larger, the signals become stronger. 
For settings with good separation of nulls and non-nulls and with strong signals, it is easier to 
achieve high power while keeping FDR controlled. Here, we choose four settings, with two levels 
of separation pi G {2, 3}, and two levels of signal strength p 2 G {2,3}. 

^http;//www.stat.uchicago.edu/~rina/accumulationtests.html 
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Under each simulation setting, we compare the performance of the four selected accumulation test 
methods. In each trial, the power and FDP for each accumulation function and rejection rule are 
recorded, over a range of target FDR levels a e {0.05,0.075,..., 0.25}. We also record the esti¬ 
mated FDP in each trial as we move along the list of hypotheses, FDPh(A:) , for each of the four 
methods, and compare to the true false discovery proportion FDP(fc), as k ranges from 1 to n. The 
performance results are averaged over 50 trials. 

5.1.2 Results 

Figure 2 shows the average power and average observed FDR of the four selected accumulation test 
methods, plotted against the target FDR level a. In the weak signal regime {p ,2 = 2), the methods 
are mostly conservative in terms of FDR, with an observed FDR that is lower than the target level 
a, and thus power is low—this is expected, since the non-null p-values contribute to the estimated 
false discovery proportion. In contrast, with strong signal (/i 2 = 3), the observed FDR levels are 
closer to a, and in fact HingeExp has slightly higher FDR than desired when separation is poor 
(/ii = 2)—this is not unexpected, since HingeExp only guarantees the control of modified EDR 
(see Lemma 1). The power of the methods improves with stronger signal (larger /i 2 ) and with better 
separation (larger /ii). Across all settings, HingeExp consistently gives the highest average power 
and observed EDR, while SeqStep-r is generally the most conservative, with lowest average power 
and observed EDR. 

To further explore the differences between the methods, we also compare the estimated false discov¬ 
ery proportion along the list of hypotheses, FDPh(A:)forfc = 1,..., n, for each of the four methods, 
and compare with the actual false discovery proportion, FDP(fc). (Eor the SeqStep-t method we 
define FDPf,(fc) to agree with the method definition (4).) Eigure 3 shows the results, averaged over 
50 simulations. Eor settings with stronger signals (i.e. /i 2 = 3), FDPh(A:) is a good estimate of 
FDP(fc), while for settings with weak signals (e.g. ^2 = 2), FDP(,(A:) overestimates FDP(A:), as 
expected. Comparing across methods, the HingeExp method function yields the estimate FDPh(fc) 
that approximates the actual EDP best. 

5.2 Simulations for the least angle regression (LARS) path 

Inference for high dimensional regression has been a problem of wide interest in many modern ap¬ 
plications. Recently, Taylor et al. [37] proposed the spacing test method for post-selection inference 
of LARS (closely related to the commonly used LASSO method for penalized sparse regression 
[38]), which gives a p-value for each feature in the order of being selected in the LARS path. The 
zth p-value, pi, is distributed as Unif[0,1], under the null hypothesis that the partial regression co¬ 
efficient of the zth variable in the LARS path is 0, in the model containing all active variables at 
the zth LARS step; furthemore, the p-values are independent. This provides an ordered list of p- 
values that follows the assumptions of ordered hypothesis testing problem, and therefore, can be 
treated with the accumulation method. (Note, however, that the null hypothesis refers to the partial 
regression coefficient, while we may often be interested in the coefficient in the true model.) As the 
sequence corresponds to signal and noisy features in the LARS path, the test provides a stopping 
rule for LARS with guarantee on EDR level (here FDP(A:) is the proportion of noise among all 
features included up to the /cth LARS step). In this simulation, compare the performance of the Se- 
qStep (with parameter C = 2), SeqStep-r (with C = 2), EorwardStop, and HingeExp (with C = 2) 
accumulation test methods, under three settings of feature signal strength. 
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Poor separation & weak signal (pi = 2, p 2 = 2) 



0.05 (UO ons 020 0?25 

Target FDR 


Good separation & weak signal (p.i = 3, 1 x 2 = 2) 



0.05 010 015 020 0?25 

Target FDR 


Poor separation & strong signal (pi = 2, 1 x 2 = 3) 



0.05 010 0A5 020 ois 

Target FDR 


Good separation & strong signal (pi = 3, P 2 = 3) 



Target FDR 


Figure 2: Power and observed FDR level of the SeqStep, SeqStep+, ForwardStop, and HingeExp methods, 
plotted against target FDR level a (averaged over 50 trials). 





























Estimated FDP(k) Estimated FDP(k) 


22 


Poor separation & weak signal (pi = 2, p 2 = 2) 


Poor separation & strong signal (pi = 2, p 2 = 3) 




Good separation & weak signal (pi = 3, P 2 = 2) Good separation & strong signal (pi = 3, P 2 = 3) 




Figure 3: Estimated FDP with the SeqStep, SeqStep+, ForwardStop, and HingeExp methods, plotted against 
the true FDP, across k = 1,... ,p (results are averaged over 50 trials). 
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5.2.1 Methods 

In all three simulation settings, there are N = 200 observations on p = 100 features, of which 
either k* = 10 or k* = 20 are true signals (nonzero coefficients) The design matrix X consists 
of i.i.d. standard normal entries. The nonzero signals Pj, for features j = 1,..., k*, are taken to 
be equally spaced value from 2 • 7 to yj2 log(p) • 7 , where 7 = 1, 3, 5 in the three settings. This 
forms a gradient from weak signal to strong signal scenarios. The remaining coefficients are set as 
= • • • = /3p = 0. The response is then generated as 

y = Xj3 + e, 

where the entries of e are also i.i.d. standard normal variables. Given the simulated data X, y, the 
LARS method and the spacing test are applied, yielding p-values for sequential testing. Note that 
the number of hypotheses is now given by p (one for each feature), rather than the former notation 
n. 


5.2.2 Results 

Figure 4 shows the average power and observed FDR of the four accumulation tests, averaged over 
50 trials. When 7 = 1,3, all four methods successfully control FDR, and when 7 = 5 , SeqStep+ 
and ForwardStop control FDR well, while HingeExp and SeqStep slightly exceed the target FDR 
level a. In all settings, HingeExp attains the highest average power and observed EDR level, while 
SeqStepH- is extremely conservative for lower a values (due to the fact that, with only k* = 10 or 
k* = 20 true signals, few discoveries are made overall, so the slightly conservative correction in this 
method (4) has a large effect). 

We also plot the estimated false discovery proportion, FDPh(fc) over the first k steps of the LARS 
path (k = 1,... ,p), against the actual FDP(A:). Figure 5 shows the results, averaged over 50 
simulations. As expected, the estimated FDP levels increasingly overestimate the true FDP as signal 
strength 7 decreases. SeqStep+ is quite conservative due to the correction term in the method’s 
definition, while the other three methods show no consistent trend in terms of accurate estimation of 
FDP(fc). 

6 Application to dosage response data 

We now show an application of our methods to the problem of identifying effects of drug dosage on 
gene expression levels. Code to reproduce this real data experiment is available online.^ 

Suppose that we measure the following data: gene expression levels for genes i = 1,... ,n are 
measured in m = me + m\_ + mn independent trials, where the trials {1,..., m} are partitioned 
into three sets Tc = {1, ..., me}, T\_ = {mc + 1, • ■ • me+m\_}, andTn = {mc+rriL + l,..., me + 
mi + mn}, such that: 

• For each j G Te, trial j is carried out in the absence of the drug (the control group); 

• For each j G Ti, trial j is carried out under a low drug dosage; and 

• For each j G Th, trial j is carried out under a high drug dosage. 

^http; //www .stat.uchicago.edu/~rina/accumulationtests.html 
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Figure 4: Power and observed FDR level of the SeqStep, SeqStep+, ForwardStop, and HingeExp methods for 
the LARS path, plotted against target FDR level a (averaged over 50 trials). 
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k* = 20 signals; signal strength y= 5 


k* = 20 signals; signal strength y= 3 


k* = 20 signals; signal strength y= 1 





k* = 10 signals; signal strength y= 5 


k* = 10 signals; signal strength y= 3 


k* = 10 signals; signal strength y= 1 





Figure 5: Estimated FDP with the SeqStep, SeqStep+, ForwardStop, and HingeExp methods for the LARS path, 
plotted against the true FDP, across k = 1,... ,p (results are averaged over 50 trials). 
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For each gene i and each trial j, let Xij € ffi be the logarithm of the measured gene expression level 
of gene i in trial j. 

In this type of setting, we may potentially be interested in various questions related to estimating the 
effect of dosage on gene expression level and testing for the significance of these effects. In general, 
identifying genes that respond to the higher dosage will be easier than at the lower dosage, since the 
magnitude of the response will often depend on the dosage used. To be specific, for each gene i we 
are interested in testing the null hypothesis Hi, which states that the observations 

{Xij : j G Tc U Tl} 

are i.i.d. (in other words, the low dose has no effect on the distribution for the gene expression level 
for gene i). For simplicity in the discussion and analysis below, we treat the measurements for each 
gene as though they were independent—of course, this does not hold in practice, and in future work 
we hope to develop results on FDR control of the accumulation tests under p-value dependence. For 
the present experiment, each of the methods we compare here comes with theoretical guarantees of 
FDR control only under the independence assumption.^ 

In order to identify which genes are differentially expressed at the low dosage level (as compared to 
the control), we can take several different approaches: 

• Two-sample approach. First, we can take the simple approach of comparing only the control 
data and the low dosage data, while discarding the high dosage data. For each gene i, we 
would calculate a p-value pi by comparing the control group observations {Xij : j G Tc} 
with the low dosage observations {Xij : j G Tl}. To control FDR, we could then apply 
the Benjamini-Hochberg procedure [3]. This approach has the drawback that it makes no use 
of the valuable information in the high dosage trials; in our experiment, we show that this 
approach has relatively low power. 

• Joint model approach. At the other extreme, we could construct a joint model for gene re¬ 
sponses at each of the dosage levels, and fit this model to the entire data set. Using the entire 
data set, including the high dosage data, would increase our power to detect differential re¬ 
sponses at the low dosage since information is shared across these two conditions. However, 
the validity of the inference we perform could be extremely sensitive to the choice of the 
model. 

• Sequential testing approach. Finally, we may apply a sequential testing procedure, which 
combines the benefits of the two approaches above. In our experiment, we will see that this 
approach will control FDR, while gaining substantial power as compared to the first approach 
where the high dosage data is discarded. 

For the remainder of this section, for two sets of observations A and B, define Pval(A, B) to be the 
p-value produced by a two-sided two-sample t-test comparing the observations in A with the obser¬ 
vations in B (assuming unequal variances between the two populations). Define Pval_|_(Gl, B) and 
PvaLjA, B) analogously for one-sided two-sample t-tests, where Pval+(A, B) tests for evidence 
that the mean of A’s population is larger than the mean of H’s population, and PvaL(A, B) does 
the reverse. 

^ While several existing methods for multiple testing yield guarantees for FDR control even in the case of dependent 
p-values, the methods we are aware of are quite conservative and yield nearly zero power in this experiment. Specifically, 
Benjamin! and Yekutieli’s modification 15] of the BH method, the Holm-Bonferroni method 125], and the Bonferroni correc¬ 
tion each yielded, even at target FDR level a. = 0.9, no more than two discoveries for this gene expression experiment; in 
contrast, the accumulation test methods examined here yield thousands of discoveries. 
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We now follow these steps to reformulate the dosage/gene expression problem as a sequential hy¬ 
pothesis testing problem: 

1. For each gene i, calculate as 

= Pval ({X,, : j e Th}, {X,, : j e Tc U rj) . 

Record also Si G {-f, —}, the sign of the estimated effect, i.e. 


Si = sign 



jeTH 


1 

me + m\_ 


jeTcUTL 


2 . 


We then use these high-dosage p-values to relabel the n genes, that is, reorder the genes so 
that 

^high < ^high < ... < ^Hgh ^ 


3. Next, for each gene i, compute an initial p-value by comparing the low-dosage trials with the 
control trials. We use a one-sided two-sample t-test (determined by the sign Si): 

pr* = PvaU. {{X,, : j e Tl}, {X,, : j G Tc}) . 

The reason that we use a one-sided t-test (rather than two-sided), is that, if for instance we 
observe a positive response at the high dosage for gene i, then we are much more likely to see 
a positive (rather than negative) effect at the low dosage, as well. Therefore, using a one-sided 
t-test is likely to achieve higher power than a two-sided test. 


4. Now we transform to the final p-values using a permutation test. For each gene z, for each 
possible permutation tt on the trial labels me + mi}, compute 


= PvaU- : j G Ti], : j G Tc}) . 

We then calculate the final p-value by comparing with the empirical distribution {p}’ : 
all possible permutations tt}: 


= 




. „init 


<Pi} 


{me + wl)! 


(26) 


and perform the accumulation test on this sequence of p-values. 


In fact, since p}’ depends only on the partition of the me + mi many trial labels into two groups 
of size me and mi (the control group and the low-dose group), we only need to calculate p}’ for 
permutations. 

Note that p^'^^ and Si depend on {Xij : j G TeU Tl}, and so the reordering is not independent 
of this data. Therefore, we cannot use the t-test p-values p™* for the accumulation test. However, 
p)"®^ and Si are invariant to permutations of this input by definition of the two-sample t-test. In other 
words, even after conditioning on (pj'®^, Si), the variables {Xij : j G TeU Tl} are exchangeable 
under the null hypothesis Hi. Therefore, even after reordering the genes according to the high- 
dosage p-values and recording signs Si, the final permutation test p-values pi that we calculate 
are valid p-values for each true null hypothesis Hi. Our theory thus guarantees FDR control when 
the accumulation test is applied to these permutation test p-values (if we assume that the data for 
each gene is independent). 
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6.1 Empirical results 

We now implement the methods described above on real data. All computations are carried out in 
R [29]. 

The data® [10] measures differential expression in response to estrogen in breast cancer cells. The 
data set consists of n = 22283 genes with 25 trials, with 5 trials each for the control group and for 
four different dosage levels. For our experiment, we use the me = 5 control trials, the ttil = 5 trials 
at the lowest dosage, and the toh = 5 trials at the highest dosage. 

We compare the following methods, each with target FDR level a = 0,0.01, 0.02,..., 0.90: 

• The accumulation test applied to the sequence of one-sided t-test p-values pi (26), using one 
of the following accumulation functions: 

h(f) = 2 • l{f > 0.5} , h(f) = 2 • log ^ (1^) ’ 

corresponding to the SeqStep method with C = 2, the HingeExp function with (7 = 2, and 
the ForwardStop method®, respectively. We also test SeqStep-t (recall (4)) with (7 = 2. 

• The Benjamini-Hochberg procedure [3], using p-values that compare the low dosage trials 
with the control trials, via either a two-sided t-test, 

pf = Pval i{X,j : j G Tl}, {X,, : j G Tc}) , (27) 

or a the permutation test on these t-tests, 

(28) 

• Storey [34]’s modification of the Benjamini-Hochberg procedure, applied to either the t-test 

p-values (27) or the permutation test p-values (28). Since the value of a ranges from 0 to 0.9, 
we estimate the number of true nulls as fho = 10 • > 0.9} or toq = 10 ' #{* • 

pPerm ^ 0.9}, respectively, for the two types of p-values. 

For the various methods. Figure 6 displays the number of discoveries against the target FDR level a. 
We see that the accumulation tests far outperform the Benjamini-Hochberg and Storey procedures. 
At target FDR levels a < 0.5, the Benjamini-Hochberg and Storey methods are unable to make 
more than a few discoveries, while the accumulation tests produce many discoveries. 

Comparing the accumulation tests that are studied here, the SeqStep and SeqStep-t procedures are 
almost identical, showing that when the number of discoveries is high, the slight correction in the 
definition of the SeqStep-t method (4) has essentially no loss of power relative to SeqStep. The 
HingeExp method yields substantially more discoveries than SeqStep and SeqStep-t, which in turn 
yield more discoveries than EorwardStop. 

^Data available at http://www. nebi . nlm. nih. gov/sites/GDSbrowser ?acc=GDS232 4 or via the 
GEOquery package [11] in R. 

®For the ForwardStop and HingeExp methods, since = 252, our permutation test p-values take values in the set 
'f 2 I 2 ’ 2§2 ’ ■' • ’ iff }■ possibility of pi = 1 for some genes i is problematic because h(l) = -|-oo for these methods. 
Therefore we shift the p-values slightly to take values { , • • ■, fff } ■ Although technically this may violate the FDR 

control properties of the methods, the shift is extremely small and should not cause issues. 
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Target FDR level a 


Figure 6: Results for the differential gene expression experiment: for each method, the plot shows the number 
of discoveries made (i.e. the number of genes selected as showing significant change in expression at the low 
drug dosage), at a range of target FDR values a. Note that the SeqStep and SeqStep+ methods are nearly 
indistinguishable in the plot. 


Overall, the comparisons across the different accumulation tests examined here conhrms the higher 
power attained by HingeExp compared to existing methods in the family. We also see substantial 
power gain of the accumulation tests as compared to the Benjamini-Hochberg and Storey proce¬ 
dures, both of which do not use a sequential structure and do not make use of the high-dosage data, 
demonstrating the benehts of the ordered hypothesis testing approach. 


7 Conclusion 

In this paper, we have considered the multiple testing problem in the context of prior information or 
structure on the list of hypotheses Hi ,to be tested. The proposed family of accumulation 
tests generalizes existing methods for this ordered testing problem, and the new HingeExp method 
within this family gives significantly higher power than the existing tests, while maintaining control 
of the modihed false discovery rate in a finite-sample setting, and asymptotic control of the false 
discovery rate. Our theoretical results prove EDR control for methods in this family in general, and 
examine the power properties of the tests within this family. These methods are a natural fit for 
any multiple testing problem where there is an inherent ordering to the hypotheses, but many other 
settings can be framed in this way as well—our real data experiment, which uses measurements of 
gene expression level across a gradient of drug dosages, indicates that we can achieve strong power 
gains by framing the problem as an ordered hypothesis testing problem. 

In general, existing structure within the multiple testing problem can substantially increase our 
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power to detect signals while maintaining low false positives. In addition to the ordered testing 
setting considered here, real problems can exhibit many types of structure—for example a hierarchi¬ 
cal structure (represented as a tree) on the set of hypotheses, a grouped structure, or prior information 
that gives a prior distribution for the status (true signal or null and/or strength of the signal) of each 
hypothesis. These types of multiple testing problems offer more information or a different type of 
structure relative to the ordered testing problem considered here. Existing literature offers some 
methods for several of these settings. It would be of interest to examine whether a general frame¬ 
work can encompass all of these possible structures for the multiple testing problem, to offer a single 
unifying approach that is flexible enough to incorporate any type of informative structure in order to 
discover as many signals as possible. 
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A Appendix: proofs 

A.l Details for the proof of Theorem 1 

In this section we show details for deriving the results of Theorem 1 from the key lemma, Lemma 3. 
These calculations essentially follow the proof of [1, Theorem 3] but we include them here for 
completeness. 

Proof of Theorem 1. First, note that the result (9) for bounded accumulation functions is simply a 
special case of the general result (10), since if the accumulation function h is bounded by C then 


f h{t)ACdt= f h(f) df = 1 . 
Jt=o Jt=o 


Therefore it suffices to prove (10). For the first bound in (10), treating FDP(fcf[''^), we have 


E 


FDP«^) 


= E 


< E 


= E 


#{i < :ieno} 




>o} 


1 + : i € TLo} 




< a • E 


< a • 


1 + ^ k^^ : i € T-Lq} C + 

c + eHiKp^) 

1 + ^{i < k^^ : i G ?io} 


C + Eiil KVr) 


by definition of 


/t=o h(f) A C dt 


by Lemma 3. 


Turning to the first bound in (10), treating mFDP( 7 /„(fch), we have 


E 

mFDPc/a(fch) 

= E 

#{i < fch : t e Ho} 




C jOt ~h A^h 


= E 


< E 


#{t < fch : t e 'HqI ^ C + h(p,) 

C + EtiKVi) C/aPK 

#{* <ku'iG 'Ho} C + fch • a 


= a • E 


< a ■ 


C + EtiKPi) C/a + k^ 

#{i < fch : t e Ho} 

C + YhUKPr) . 

by Lemma 3. 


by definition of fch 


/t=o hW A C df 


□ 
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A.2 Proof of Lemma 5 (bounds on random walks) 

Proof of Lemma 5. First, define 

a = max | a, 5V21og2 (4/.)} . 

Since a > a, trivially each Xi is {a, 5)-subexponential. Then for all 9 € [0, and all f > 1, 

r n2~2 

E < exp ■ 




Now taking any f > 1, and any r > 0 such that 9 = 


> r i =pi6»^X, > 9r \ < > e^’'| 


< E 

< exp ( 


i=l 

2;;.2 


f 19"^ a 


I 2 


• e (Markov inequality) 


= exp < - 


2ta^ } 


by taking 9 = . 


By an identical argument, the same bound holds for P|X]i=i^* — therefore, for 


all 


f > 1 and r < 




> r > < 2 exp < — 


J 


Set r = -^2 log 2 (^/e) • ct • ■\/f log(l + f). Since -^/f log(l + f) < t for all f > 1, note that the upper 
bound on r is satisfied by definition of a. Then we get 




> V 21 og 2 m ■ a ■ ^tlogil + t) < 2 exp {- log 2 m ■ log(l + t)} = 2{l+t)- 


Then we have 


p E^* 

(i=i 

^i-E 


< all t > 1 


t>i 




> \/ 21 og 2 (V0^\/^los(l + 0 


_ poo 

> 1 - ^ 2(1 + t )-> 1 - / 2f- df 

t>l ■Jt=2 
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Ul-l0g2(4/e)] “ 

1 + 2^^-= 1- 


= 1 + 


l0g2(V«^)-l l0g2(f)-l 


> 1 - e . 


□ 
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A.3 Proof of Lemma 2 (bounded accumulation functions) 

Proof of Lemma 2. We have 

E[h(pi)] -E[ho(p*)] = 


/ - ^oit)) ■ fz(t) dt 

ft^O 
.1-1/C 

/ (h(i) - 0) • fi{t) dt 

it=o 

rl-lIC 


lt=l-l/C 


(h(t) - C) ■ f{t) dt 


> 


.1-1/C .1 

/ h(t) •/i(i) dt - / {C - h{t)) ■ Mt) dt 

Jt=o Jt=\-\/C 

/■l-l/c pi 

/ h(t) • /,(! - 1/C) dt- (C - h(t)) • /,(1 - 1/C) dt 

Jt=0 Jt=l-1/C 




/ h(f) dt — 

/ Cdf 

Jt=o 

yt=i-i/c 


= /*(!-1/C 

= /,(l-l/C)-[l-l] = 0, 

where the inequality is true since fi is nonincreasing and since h(t) > 0 and C — h(t) > 0. 
Furthermore, if the inequality is not strict (i.e. is an equality), then we must have 

pl-l/C pi 

/ m-[m - /*(1 - l/C] dt = 0 and / (C - h(t))-[/,(l - 1/C) - /,(t)] dt = 0 . 

Ji=0 Jt=l-l/C 

Note that, in both integrals, the integrand is nonnegative. Therefore, in order for the integrals to 
equal zero, it must be true that the integrands are equal to zero almost everywhere. However, if fi 
is strictly decreasing then the terms in square brackets are strictly positive in both integrals (except 
at endpoints). Therefore, in the first integral we must have h(f) =0 almost everywhere over t € 

(0,1 — 1/C), and in the second integral we must have C — h(f) = 0 almost everywhere over 
f e (1 — 1/C, 1). In other words, h(f) = ho(f) almost everywhere over t G [0,1]. □ 


A.4 Details for proof of Theorem 3 

Proof of Theorem 3 (continued). Here we fill in the remaining details for the proof of Theorem 3, 
namely, we consider the cases that >/(0)or^</(l). 

Case 2: a satisfies ^ /(O) Por diis case, we will show that power tends to zero. As in the 
first case, it will be sufficient to show that 

FDPh(A:) > a for all k > max{T„ • n, log(n)} . (29) 

To prove this, take any such k. Then, if the event (22) holds, we apply (23) to get 

FDPh(fc) > E > E(t„) - /?„ = 1 - /(t„) • (1 - /r) - /3„ . 

If /(O) = iCl /(O) > 1 — a and so /(t„) >1 — a for sufficiently large n. Therefore, 

applying assumption (13), /(t„) < /(O) — t „(5 and then 

FDPh(fc) > 1 - (/(O) - TnS) • (1 - /r) - /3„ = a . 
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Alternately, if /(O) < iEf, then since / is continuous, for sufficiently large n we have /(t„) < 

i-a-/3„ same bound holds. 

l-M 

Case 3: a satisfies < /(I) For this case, we will show that power tends to 1. As in the 
previous cases, it will be sufficient to show that 

FDPf,(fc) < a for all fc < n • (1 — r„) . 

To prove this, take any such k. Then, if the event (22) holds, we apply (23) to get 

FDPh(fc) < E + /3„ < E(1 - T„) + /3„ = 1 - /(I - T„) ■ {1- fi) + /3n ■ 

Since /(I) > > 1 — a, we see that f(t) > 1 — a for all t € [1 — Tn, 1], and so /(I — r„) > 

/(I) + TnS by assumption (13). Then, 

FDPh(fc) < 1 - (/(I) + Tn6) • (1 - /r) +/3„ < a . 

□ 

A.5 Proof of Lemma 1 (FDR control for the HingeExp function) 

Proof of Lemma 1. First, by Jensen’s inequality, drawing i?o,i)£' 0,2 Exponential(l) indepen¬ 
dently from the p-values pi,... ,p„, 

g #{t < fch : » e 'HqI _ _ #{t < fch : t e 'HqI _ 

2Ca-^_ _C'a-iE[So,i+^^0.2 |pi,---,Pn]+^h 

Ft,---,; 

g #{i < fch : » e 'HqI 
_Ca~^{EQ^i + i?o,2) + ^h_ 


<e[e [ 

Ca ^(£^0,1 +-E' 0 , 2 ) + 
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Next, 


E 


= E 


< E 


= E 


< E 


#{i < fch : i e 'HqI 

C'a“^(i7o,i + Eq,2) + 

#{z <ku:iG Ho} 


C{Eo^i + Eo^2) + 


i<kh,iGl-Lo 


KPi) 


C(Eo,l + Eo,2) + E*<fc,,i6Ho h(p») 

#{z Ho} 


h(p*) 


= a • E 


^(So.l + Eo,2) + E*<fc,,i6«o 

_ #{z <kh:ie Ho} _ 

C{Eo,i + £^0,2) + E*<fc,,i 6 «o 

_ #{z < fch : » e 'HqI _ 

^(■^0,1 +£0,2) + E*<fc,,i 6 «o C'a-i(£o.i +£0,2) +^h 

#{z <ku-.ie Ho} 


Ca ^(£ 0,1 + £ 0 , 2 ) + 
£(£ 0,1 + £ 0 . 2 ) + J2i<kh 
C'q;“^(£o,i + £ 0 , 2 ) + fch 
£(£ 0,1 + £ 0 , 2 ) + fch ■ FDPh(fch) 
£q;“1(£o,i + £ 0 , 2 ) + fch 

£(£0,1 + £0,2) + /ch • a 


£(£ 0,1 + £ 0 , 2 ) + J2t<ku,ieHo 


Next, note that h(pi) is equal in distribution to C ■ Bi ■ E^, where Bi ~ Bernoulli(l/£) and ~ 
Exponential(l), for all z € Tfo- (Here we assume that the variables Bi and Ei are all mutually 
independent). Therefore we have 


E 


#{z <kh:ie Ho} 


Ca ^(£0,1 + £0,2) + ktn 

where we define 

Mk = 


<oTe 


#{z <kh:ie Ho} 


E< 


0,1 


-£, 


0,2 


■E. 


i<kh,iGl-Lo 
1 + #{* < k : i e Ho} 


Bi ■ Ei 


1 


< a- —-E 

Mt 

“ £ 

fch 


£ 0,1 + £ 0,2 + E*<fc.j6«o 

Next, we prove that Mk is a supermartingale with E [M„] < C, and that fch is a stopping time. 

Let Fk be the cr-algebra defined by knowing £ 0 , 1 , £ 0 , 2 , knowing {Bi, Ei) for all i ^ Ho, knowing 
{Bi, Ei) for all z > fc with z G Ho, and knowing {{Bi, Ei) : i < k,i G Ho} (note that this is an 
unordered set, as before in e.g. the proof of Lemma 3). Let Fk be the cr-algebra that additionally 
knows Bi,..., Bn- 


Now we show that E [Mk \ Fk+i] < Mk+i- If fc + 1 ^ Ho or if Bk+i = 0, then Mk < Mk+i 
trivially. Turning to the case where fc + 1 € Ho and Bk+i = 1, we begin by conditioning on 
Bi, ..., Bn- In that case, we see that 

£ 0,1 + £ 0,2 + Bi ■ Ei 

i^k^iGl-Co 

is a sum of (2+Ei<fc i^no many Exponential(l) variables, which ^ Gamma(2+Ei<fc ieUo 

while 

£fc-i-i ■ £fe-i-i 

is equal to another (independent) Exponential(l) variable, which ~ Gamma(l, 1). Using the fact 
that 

If 7f ~ Gamma(fc), L ~ Gamma(/), X EY, then l).-r ;—— ~ Beta(fc, 1), 2).— —— ± X + Y . 

y\. Y y(. Y 
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Conditioning on Si, B„, 


Eq^i + Eo ^2 + Y.i<k,ieHo 
£^0,1 + £0.2 + Y.i<k+l,i&Ho 


Beta(2+ ^ £*,1) 


And conditioning on £fc+i yields the same result. Now, using the fact that 


If a: 


Beta (a, (3) and a > 1 then E 


1 


a + P — 1 
a — I 


Therefore, 


E 


We then have 


£0,1 + £0,2 + 




B, ■ Ei 


£ 


0.1 


■E, 


0,2 


E 


i<k,i£'HQ 


Bi ■ Ej 


£ 


fc+i 


i + E 


i<k-\-l,i£'HQ 


B, 


i + E 


%<k,te'Ho 


E [Mk I -£fc+i] 

1 + #{* < k : i G Ho} 


= E 


= E 


= E 


£( 


0.1 -I- £0.2 + Ei<fc,ieWo 
1 + #{* < k : i G Ho} 


Bj ■ Ej 


Ek+i 


£( 


0,1 


£( 


0,2 


■E 


i<k-\-l,i^'HQ 


Bj ■ Ej 


£0,1 + £0,2 + Ei<fc+i,ieWo + Ei<fc,ieWo 


S, • Ej 


El 


fc+i 


1 + #{* < k : i G Ho} 


E, 


0,1 


£ 


0,2' 


E, 


i<k+l,ie'Ho * 


Bj ■ Ej 


•E 


£ 


0,1 


■ £0,2 + E 


i<k-\-l,i^'H[ 


B, ■ £ 


£ 


0,1 


•£ 


0,2 


■E, 


i<k,ie'Ho * 


£, • £ 


El 


fc+i 


since here /c + 1 € Ho, #{* < k : i G Ho} = #{t < fc + 1 : i € £ 0 } — 1 is known, given Ek+i 


= E 


1 + #{* < k : i G Ho} 


1 + Ej 


<fe+i,jeWo 


B, 


£ 0,1 + £ 0,2 + Ei 

1 + Ej<fc+l,i6Ho 


z<A;+1,zG'^o 


£0,1 + £0,2 + Ei</c+i,ieWo 


£ • Ej 

•E 


1 + E 


i^k^i^Tio 


B, 


Ek+i 


1 + #{* < k : i G Ho} 

Ti , 1 

1 + Ei<fc,i6Wo 



< 


1 + E 


j<fc+i,ieWo 


1 + < fc + 1 : i € £ 0 } 


£0,1 + £0,2 + Ei</c+i,ieWo 1 + Ei<fc+i.ieWo 

1 + #{* < A: + 1 : t G £ 0 } 

£0,1 + £0,2 + Ei</c+i,ieWo 


= Ml 


fc+i ) 


-£fc+i 


where the inequality in the next-to-last step comes from Lemma 4. This proves that Mk is a super- 
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martingale. Finally, we have 


E [M„] = E 

= E 

= E 


l + |Ho 


^'O,! + £'0,2 + 

l + \Ho\ 


E 


E 


£o,l + £0,2 + X]i 6 «o 
l + \Ho\ 




Gamma(2 + X;i6« £*,1) 


E®' 

iGl-Lo 


<E 

<£, 


i + \no\ 

(2 + E.e«„ £.)-! 


where again we apply Lemma 4 for the last step. 


□ 



